17 research outputs found
Differentiable Unbiased Online Learning to Rank
Online Learning to Rank (OLTR) methods optimize rankers based on user
interactions. State-of-the-art OLTR methods are built specifically for linear
models. Their approaches do not extend well to non-linear models such as neural
networks. We introduce an entirely novel approach to OLTR that constructs a
weighted differentiable pairwise loss after each interaction: Pairwise
Differentiable Gradient Descent (PDGD). PDGD breaks away from the traditional
approach that relies on interleaving or multileaving and extensive sampling of
models to estimate gradients. Instead, its gradient is based on inferring
preferences between document pairs from user clicks and can optimize any
differentiable model. We prove that the gradient of PDGD is unbiased w.r.t.
user document pair preferences. Our experiments on the largest publicly
available Learning to Rank (LTR) datasets show considerable and significant
improvements under all levels of interaction noise. PDGD outperforms existing
OLTR methods both in terms of learning speed as well as final convergence.
Furthermore, unlike previous OLTR methods, PDGD also allows for non-linear
models to be optimized effectively. Our results show that using a neural
network leads to even better performance at convergence than a linear model. In
summary, PDGD is an efficient and unbiased OLTR approach that provides a better
user experience than previously possible.Comment: Conference on Information and Knowledge Management 201
Optimizing Ranking Models in an Online Setting
Online Learning to Rank (OLTR) methods optimize ranking models by directly
interacting with users, which allows them to be very efficient and responsive.
All OLTR methods introduced during the past decade have extended on the
original OLTR method: Dueling Bandit Gradient Descent (DBGD). Recently, a
fundamentally different approach was introduced with the Pairwise
Differentiable Gradient Descent (PDGD) algorithm. To date the only comparisons
of the two approaches are limited to simulations with cascading click models
and low levels of noise. The main outcome so far is that PDGD converges at
higher levels of performance and learns considerably faster than DBGD-based
methods. However, the PDGD algorithm assumes cascading user behavior,
potentially giving it an unfair advantage. Furthermore, the robustness of both
methods to high levels of noise has not been investigated. Therefore, it is
unclear whether the reported advantages of PDGD over DBGD generalize to
different experimental conditions. In this paper, we investigate whether the
previous conclusions about the PDGD and DBGD comparison generalize from ideal
to worst-case circumstances. We do so in two ways. First, we compare the
theoretical properties of PDGD and DBGD, by taking a critical look at
previously proven properties in the context of ranking. Second, we estimate an
upper and lower bound on the performance of methods by simulating both ideal
user behavior and extremely difficult behavior, i.e., almost-random
non-cascading user models. Our findings show that the theoretical bounds of
DBGD do not apply to any common ranking model and, furthermore, that the
performance of DBGD is substantially worse than PDGD in both ideal and
worst-case circumstances. These results reproduce previously published findings
about the relative performance of PDGD vs. DBGD and generalize them to
extremely noisy and non-cascading circumstances.Comment: European Conference on Information Retrieval (ECIR) 201
Balancing Speed and Quality in Online Learning to Rank for Information Retrieval
In Online Learning to Rank (OLTR) the aim is to find an optimal ranking model
by interacting with users. When learning from user behavior, systems must
interact with users while simultaneously learning from those interactions.
Unlike other Learning to Rank (LTR) settings, existing research in this field
has been limited to linear models. This is due to the speed-quality tradeoff
that arises when selecting models: complex models are more expressive and can
find the best rankings but need more user interactions to do so, a requirement
that risks frustrating users during training. Conversely, simpler models can be
optimized on fewer interactions and thus provide a better user experience, but
they will converge towards suboptimal rankings. This tradeoff creates a
deadlock, since novel models will not be able to improve either the user
experience or the final convergence point, without sacrificing the other. Our
contribution is twofold. First, we introduce a fast OLTR model called Sim-MGD
that addresses the speed aspect of the speed-quality tradeoff. Sim-MGD ranks
documents based on similarities with reference documents. It converges rapidly
and, hence, gives a better user experience but it does not converge towards the
optimal rankings. Second, we contribute Cascading Multileave Gradient Descent
(C-MGD) for OLTR that directly addresses the speed-quality tradeoff by using a
cascade that enables combinations of the best of two worlds: fast learning and
high quality final convergence. C-MGD can provide the better user experience of
Sim-MGD while maintaining the same convergence as the state-of-the-art MGD
model. This opens the door for future work to design new models for OLTR
without having to deal with the speed-quality tradeoff.Comment: CIKM 2017, Proceedings of the 2017 ACM on Conference on Information
and Knowledge Managemen
Policy-Aware Unbiased Learning to Rank for Top-k Rankings
Counterfactual Learning to Rank (LTR) methods optimize ranking systems using
logged user interactions that contain interaction biases. Existing methods are
only unbiased if users are presented with all relevant items in every ranking.
There is currently no existing counterfactual unbiased LTR method for top-k
rankings. We introduce a novel policy-aware counterfactual estimator for LTR
metrics that can account for the effect of a stochastic logging policy. We
prove that the policy-aware estimator is unbiased if every relevant item has a
non-zero probability to appear in the top-k ranking. Our experimental results
show that the performance of our estimator is not affected by the size of k:
for any k, the policy-aware estimator reaches the same retrieval performance
while learning from top-k feedback as when learning from feedback on the full
ranking. Lastly, we introduce novel extensions of traditional LTR methods to
perform counterfactual LTR and to optimize top-k metrics. Together, our
contributions introduce the first policy-aware unbiased LTR approach that
learns from top-k feedback and optimizes top-k metrics. As a result,
counterfactual LTR is now applicable to the very prevalent top-k ranking
setting in search and recommendation.Comment: SIGIR 2020 full conference pape
Learning from User Interactions with Rankings: A Unification of the Field
Ranking systems form the basis for online search engines and recommendation
services. They process large collections of items, for instance web pages or
e-commerce products, and present the user with a small ordered selection. The
goal of a ranking system is to help a user find the items they are looking for
with the least amount of effort. Thus the rankings they produce should place
the most relevant or preferred items at the top of the ranking. Learning to
rank is a field within machine learning that covers methods which optimize
ranking systems w.r.t. this goal. Traditional supervised learning to rank
methods utilize expert-judgements to evaluate and learn, however, in many
situations such judgements are impossible or infeasible to obtain. As a
solution, methods have been introduced that perform learning to rank based on
user clicks instead. The difficulty with clicks is that they are not only
affected by user preferences, but also by what rankings were displayed.
Therefore, these methods have to prevent being biased by other factors than
user preference. This thesis concerns learning to rank methods based on user
clicks and specifically aims to unify the different families of these methods.
As a whole, the second part of this thesis proposes a framework that bridges
many gaps between areas of online, counterfactual, and supervised learning to
rank. It has taken approaches, previously considered independent, and unified
them into a single methodology for widely applicable and effective learning to
rank from user clicks.Comment: PhD Thesis of Harrie Oosterhuis defended at the University of
Amsterdam on November 27th 202
Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking
Counterfactual evaluation can estimate Click-Through-Rate (CTR) differences
between ranking systems based on historical interaction data, while mitigating
the effect of position bias and item-selection bias. We introduce the novel
Logging-Policy Optimization Algorithm (LogOpt), which optimizes the policy for
logging data so that the counterfactual estimate has minimal variance. As
minimizing variance leads to faster convergence, LogOpt increases the
data-efficiency of counterfactual estimation. LogOpt turns the counterfactual
approach - which is indifferent to the logging policy - into an online
approach, where the algorithm decides what rankings to display. We prove that,
as an online evaluation method, LogOpt is unbiased w.r.t. position and
item-selection bias, unlike existing interleaving methods. Furthermore, we
perform large-scale experiments by simulating comparisons between thousands of
rankers. Our results show that while interleaving methods make systematic
errors, LogOpt is as efficient as interleaving without being biased.Comment: ICTIR 202
Doubly-Robust Estimation for Unbiased Learning-to-Rank from Position-Biased Click Feedback
Clicks on rankings suffer from position bias: generally items on lower ranks
are less likely to be examined - and thus clicked - by users, in spite of their
actual preferences between items. The prevalent approach to unbiased
click-based Learning-to-Rank (LTR) is based on counterfactual
Inverse-Propensity-Scoring (IPS) estimation. Unique about LTR is the fact that
standard Doubly-Robust (DR) estimation - which combines IPS with regression
predictions - is inapplicable since the treatment variable - indicating whether
a user examined an item - cannot be observed in the data. In this paper, we
introduce a novel DR estimator that uses the expectation of treatment per rank
instead. Our novel DR estimator has more robust unbiasedness conditions than
the existing IPS approach, and in addition, provides enormous decreases in
variance: our experimental results indicate it requires several orders of
magnitude fewer datapoints to converge at optimal performance. For the unbiased
LTR field, our DR estimator contributes both increases in state-of-the-art
performance and the most robust theoretical guarantees of all known LTR
estimators
Ranking for Relevance and Display Preferences in Complex Presentation Layouts
Learning to Rank has traditionally considered settings where given the
relevance information of objects, the desired order in which to rank the
objects is clear. However, with today's large variety of users and layouts this
is not always the case. In this paper, we consider so-called complex ranking
settings where it is not clear what should be displayed, that is, what the
relevant items are, and how they should be displayed, that is, where the most
relevant items should be placed. These ranking settings are complex as they
involve both traditional ranking and inferring the best display order. Existing
learning to rank methods cannot handle such complex ranking settings as they
assume that the display order is known beforehand. To address this gap we
introduce a novel Deep Reinforcement Learning method that is capable of
learning complex rankings, both the layout and the best ranking given the
layout, from weak reward signals. Our proposed method does so by selecting
documents and positions sequentially, hence it ranks both the documents and
positions, which is why we call it the Double-Rank Model (DRM). Our experiments
show that DRM outperforms all existing methods in complex ranking settings,
thus it leads to substantial ranking improvements in cases where the display
order is not known a priori
Unifying Online and Counterfactual Learning to Rank
Optimizing ranking systems based on user interactions is a well-studied
problem. State-of-the-art methods for optimizing ranking systems based on user
interactions are divided into online approaches - that learn by directly
interacting with users - and counterfactual approaches - that learn from
historical interactions. Existing online methods are hindered without online
interventions and thus should not be applied counterfactually. Conversely,
counterfactual methods cannot directly benefit from online interventions. We
propose a novel intervention-aware estimator for both counterfactual and online
Learning to Rank (LTR). With the introduction of the intervention-aware
estimator, we aim to bridge the online/counterfactual LTR division as it is
shown to be highly effective in both online and counterfactual scenarios. The
estimator corrects for the effect of position bias, trust bias, and
item-selection bias by using corrections based on the behavior of the logging
policy and on online interventions: changes to the logging policy made during
the gathering of click data. Our experimental results, conducted in a
semi-synthetic experimental setup, show that, unlike existing counterfactual
LTR methods, the intervention-aware estimator can greatly benefit from online
interventions.Comment: Harrie Oosterhuis and Maarten de Rijke. 2021. Unifying Online and
Counterfactual Learning to Rank: A Novel Counterfactual Estimator that
Effectively Utilizes Online Interventions. In The 14th ACM International
Conference on Web Search and Data Mining (WSDM '21), March 8-12, 2021,
Jerusalem, Israel. ACM, New York, NY, USA, 9 pages.
https://doi.org/10.1145/3437963.344179